Data Ingestion using Amazon Kinesis Data Streams with S3 Data Lake

In this topic we will describe the creation of a data ingestion pipeline using Amazon Kinesis Data Streams as a data source with Databricks for data integration and ingesting the data into an S3 data lake.

Prerequisites

  • Access to a configured Amazon S3 instance which will be used as a data lake in the pipeline.

  • A configured instance of Amazon Kinesis Data Streams. For information about configuring Kinesis, refer to the following topic: Configuring Amazon Kinesis Data Streams

Creating a data ingestion pipeline

  1. On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:
    • Data Source: Amazon Kinesis Data Streams
    • Data Integration: Databricks
    • Data Lake: Amazon S3

    Kinses Data Streams using Amazon S3 data lake

  2. Configure the Kinesis node and Amazon S3 node.

  3. Click the Databricks node and click Create Job.

Complete the following steps to create a data integration job:

Running the data ingestion pipeline

After you have created the data integration job with Amazon Kinesis Data Streams, ensure that you publish the pipeline. If you haven't already done so, click Publish.you can run the pipeline in the following ways:

    1. Click Run Kinesis Data Stream pipeline . The Data Streams window opens, which provides a list of the data streams in the pipeline. Enable the toggle for the stream that you want to use to fetch data.

      Kinesis Data Streams pipeline with  S3 data lake

    2. Click the Databricks node and then click Start to run the data integration job. Navigate to Data Streams and click Refresh. The Kinesis Streaming is enabled.

      Data Integration job start

    You can see that the data stream that you enabled is now running. Click the refresh icon to view the latest information about number of events processed.

  • Troubleshooting a failed data integration job

    When you click the Databricks node in the pipeline, you know if your data integration job has failed looking at the status of the job.

    1. Click the Databricks node in the pipeline.

    2. Check the status of the Databricks integration jobThe status could be one of the following:

      • Running

      • Canceled

      • Pending

      • Failed

    3. If the job status is seen as Failed, click the (...) ellipsis and then click Open Databricks Dashboard.

      Troubleshooting Kinesis data stream job

    4. You are navigated to the specific Databricks job. This shows the list of job runs. Click the job run for which you want to view the details.

      Databricks Data Integration job list

    5. View the details and check for errors.

      Databricks job details troubleshooting

    Related Topics Link IconRecommended Topics What's next? Data Ingestion using Amazon Kinesis Data Streams with Snowflake Data Lake